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1. INTRODUCTION 

The human gaze is a natural cue that provides rich information on the attention of individuals in 
social interactions. Human beings receive and communicate various information through their eyes. Indeed, 
an eye points to the object to be analyzed during a recognition or learning operation, pointing to an 
interlocutor expresses an interest in the discussion. Based on this reality, and with the objective of improving 
the teaching and learning experience, we have chosen to analyze an aspect of the human eye which is the 
direction of gaze to extract information on the attention of students in the class during the course [1]. The 
operation must be carried out under normal course conditions, so the used devices must to be not distractive 
or invasive for the students [2]. To achieve this, we have chosen to use a single camera as a source of 
information and to deploy a method for 3D gaze estimation based only on the images of the faces provided 
by the camera. 

The present paper proposes a system for monitoring and measuring the student’s attention during a 
normal class session, based on his point of gaze. This system uses an inexpensive camera and a computer to 
gauge the student’s attention throughout the course. The field of gaze analyzed is the plane which contains 
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the display board, so a student is said to be attentive if his gaze points to an area in this plane. Analysis of the 
information collected will determine the extent to which the teaching materials and the teaching style are 
attracting the attention of the student. It will also make it possible to identify distraction objects and the 
moments of loss of attention of the students. 

Our paper is organized as shown in: In next section 2, a review of existing gaze estimation 
approaches and data sets are presented. In section 3, we explain the proposed approaches with details related 
to our system. Next, we discuss the results obtained by our system compared with those of similar systems. 
Finally, we end with some conclusions and perspectives. 


2. LITERATURE REVIEW 

In recent years, a wide variety of remote gaze tracking algorithms have been reported in the 
literature. For our use case, the general knowledge base can be divided into three categories: appearance- 
based, model-based and cross-ratio based.In this section, we present the three categories of gaze estimation 
and the datasets that were created for this purpose. 


2.1. Gaze estimation: appearance-based methods 

Choi et al. [3] used convolutional neural networks (CNNs) to perform head pose estimation with 
categorization of driver gaze areas (central rear-view mirror, left and right part of the windscreen, left 
window). They have built their own data set of men and women drivers, including situations of wearing 
glasses, their system achieves an accuracy of 95%. 

Konrad et al. [4], gaze tracking was carried out in a very constrained environment, the camera was 
placed at a distance of 51 cm from the individual’s face. To train their CNN neural networks, they built a data 
set implemented in their particular setup and composed of images of 5 subjects. The results are promising, 
however, the CNN network needs a lot of data to be properly trained. 

George and Routray [5], the proposed algorithm consists in detecting the faces present in the image 
using a modified version of the Viola-Jones algorithm, the rough eye region is obtained using geometric 
relations and facial landmarks. Then, a convolutional neural network is used for gaze direction classification. 
This algorithm was tested on the Eye Chimera data set and gave good results in terms of computational 
complexity which makes it a good choice for smart devices. 

However, Vora et al. [6] use two distinct CNN architectures (AlexNet and VGG16) to classify the 
driver’s gaze into seven zones. This two CNNs have been fine tuned on the dataset created for this purpose 
and which contains 47,515 images naturalistic driving tests of 11 drives, driven by 10 subjects in two 
different cars and labeled with 6 gaze zones. This research study submit a comparison of the performance of 
the two architectures and the authors found that VGG16 outperformed AlexNet due to the small size of the 
core (3x3) in the convolution layer. Also, they proved that using the upper part of the face as input works 
better than the whole face.Their system achieved an accuracy of 93.36%. 

A new low-cost gaze-based text entry method has been proposed in Zhang et al. [7]. This approach 
aims to help people with disabilities communicate by text using eye movement. Indeed, the authors classified 
the gaze in 9 directions responding to the input method on a T9 keyboard. The confirmation of the letter 
entered is done by blinking of the eyes. To achieve this goal, they built a convolutional neural network 
(CNN) to estimate gaze in 9 directions. The CNN was trained on a large-scale data set they created with 
images of the eyes of 25 people. According to the results, this model can estimate the gaze of different people 
in various lighting conditions, with an accuracy of 95.01%. 


2.2. Gaze estimation: model-based methods 

Model-based techniques perform gaze estimation by combining the geometric model of the eye with 
eye features, such as cornea reflection and pupil center [8], [9]. These methods attempt to estimate, using a 
geometric model of the eye, the center of the cornea, the optical and visual axes of the eye. The direction of 
gaze is then determined by the visual axis, which goes through the center of the cornea and the fovea. Unlike 
the old methods which used infrared illuminations and high resolution cameras to extract features (eyeball), 
recent methods, rely on machine learning approaches which allow them to high features extraction accuracy 
from a simple webcam images , and under varying lighting conditions [10]. 


2.3. Cross-ratio based gaze estimation 

The cross-ratio (CR) based methods are invariant to head pose changes. They perform gaze 
estimation by projecting a known rectangular pattern of near-infrared (NIR) lights on the eye of the user and 
using invariant property of projective geometry. Yoo and Chung [11] have achieved an interesting work on 
the experimental verification of cross-ratio based methods. 
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The various methods mentioned above have advantages and disadvantages. The model-based 
methods allow the gaze estimation under the constraint of the user’s head movement. But, they require the 
use of specific materials (multiple cameras and several light sources). 

Appearance-based methods learn to map directly the eye appearance to human gaze. This has 
become possible by the advancements of deep learning techniques allowing the extraction of the shape and 
texture properties of the eyes, and by the creation of several eye gaze estimation data sets. These methods 
have low hardware requirements which make them suitable for implementation on platforms without a high- 
resolution camera or additional light sources [12]. Despite, they are not able to model effectively under 
conditions of varying head positions and illumination changes. This is because the appearance of the eyes 
may look similar under different head poses and gazing directions. Changes in illumination (under the same 
pose) can change the appearance of the eye and affect the gaze estimation accuracy [12], [13]. 

Cross ratio based methods don’t require a hardware calibration mapping the position of the camera 
to the monitor and allow free head motion. However, the distance from the user greatly affects their 
performance and the projection of infrared light for a long time can tire the user [14], [15]. 


2.4. Data sets for gaze estimation 

Several large-scale gaze estimation datasets have been created in recent years, among which a good 
portion are publicly available. Some of these datasets have been constructed using images captured in labs 
under particular setups, while others were acquired in outdoor environments. Table 1 summarizes a 
comparative study of the common datasets used in gaze estimation work. 


Table 1. Summary of some common gaze estimation datasets 


Datasets Year Total Subj Purpose Configuration 
ects 

CAVE-DB [13] 2013 5880 56 Gaze estimation Collected in laboratory conditions; 5 differents 
head poses horizontally, 21 gaze directions for 
each subject and head pose. 

Eyediap [16] 2014 94 videos 16 Gaze estimation Collected in laboratory conditions; free head pose. 

UT Multiview [17] 2014 64000 50 Gaze estimation Collected in laboratory; 8 head poses and 20 gaze 
directions per head pose. 

MPIIGaze [18] 2015 213659 15 Evaluating gaze Images are captured by laptops cameras in daily 

tracking methods life; Free head pose and variable illumination 
conditions. 

GazeCapture [19] 2016 2445504 1474 Gaze estimation Images are captured by a mobile phone camera 
under different variation in head pose and 
illumination 

MPIIFaceGaze [20] 2017 45k 15 Appearance-based Collected using laptop camera, free head pose. 

gaze estimation 

RT-Gene [21] 2018 123k 15 Appearance-based Collected in laboratory; free head pose; annotated 

gaze estimation with mobile eye tracker; use GAN to remove the 
eye-tracker in face images [22]. 

Nvgaze [23] 2019 4.5M 30 Near-eye gaze Collected under laboratory conditions, using 

estimation infrared illumination. 

ETH-XGaze [24] 2020 1.1M 110 Gaze estimation Collected in the laboratory, high resolution 
images, different poses of the head and different 
gazes. 

EVE [25] 2020 4,2k videos 54 Gaze estimation Collected in laboratory; different gazes, different 


head poses with annotation. 


3. OUR APPROACH 

The system we present in this article is essentially based on a low-cost camera and a software that 
implements our algorithms and methods for monitoring the student gaze point. The proposed system consists 
of a camera placed above a display board which serves as a means of illustration or as a projection surface 
(see Figure 1). The students sit at a distance of 150 cm in front of the blackboard and follows the teacher’s 
explanations. The goal is to follow each student’s gaze and detect cases where he looks away from the 
projection or illustrations displayed on the board. 


3.1. Distance from the camera estimation 

The camera captures images of the student at a rate of 30 frames per second with a resolution of 
1280/720 px. The system first detects the student’s face and calculates the distance between the student and 
the camera which is supposed to be the center of the world coordinate system used to precisely define the 
student’s gaze point. Then, it tries to extract the 2D image coordinates of the region of interest features such 
as the iris, the internal and external corner of the eye to predict the center of the eyeball and calculate the 
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gaze vector. Finally, a transformation of the gaze vector coordinates from the image coordinates to the world 
coordinates to find the point of intersection of the gaze vector with the plane which contains the camera and 
the display board. This process is illustrated in the Figure 2. 


Figure 1. General overview of the concept of monitoring the students gaze in the classroom using a simple 
camera 


Camera sensor 


Camera to object 
distance 


Face box 


Figure 2. Principle of pinhole camera 


In this article, we use a method based on a single camera to calculate the distance between the 
student and the stationary camera. Figure 2 shows the experimental setup and the operating principle of this 
method based on triangle similarity. The segment AB represents the width in centimeters of the object placed 
in front of the camera, ab is the width in pixels of the reflection of the object on the complementary metal— 
oxide—semiconductor (CMOS) of the camera and Alfa is the angle between the optical axis and the A end of 
the object. Camera calibration is required, it should be done using images of an object captured by the camera 
at different angles in a 3-dimensional plane (x, y, z). In our scenario, it’s the face of the student which will be 
detected using an appropriate detection algorithm. The values of AB, ab, and Distance are measured, it 
remains to deduce the value of the focal length (FL) using: 


AB ab 
tana = ——— and tana = — 
2 distance 2FL 

AB ab 


2 distance 2FL 
ab x distance 
FL = —— 


AB 


Note that the distance and measurement AB are in centimeters and ab is in pixels. Once the camera is 
calibrated and the FL is calculated, we can calculate the distance from the pupil to the camera using the 
triangle similarity, so: 


FLX AB 


Distance = (1) 
3.2. Face detection and eye region localization 

As in most computer vision systems that deal with issues related to facial emotion, head pose, or 
gaze estimation, the first stage of our system is face detection. There are a multitude of techniques 
performing face detection [26]. However, the best performing approaches are those based on CNN’s because 
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they have shown very good results in extracting features from convolution layers [27]. CNN made it possible 
to jointly perform face detection and facial features extraction to reduce computation time and improve 
extraction accuracy. In Figure 3, we present the steps followed by our proposed system. We have used Dlib 
[28] for the detection and extraction of facial features. This library is open and has shown remarkable 
performance in detecting faces and extracting face landmarks. 


oordinates system to world coordinates system 


Calculate the intersection point bitween the gaze 
vector and the display board plane 


Attention = 1 


Attention = 0 


Figure 3. Flow diagram of the proposed system 


3.3. Gaze estimation 

Since gaze estimation is a crucial step in our system, we have reviewed a number of existing 
methods that best match our use case and material constraints. We have chosen to use the method of Park et 
al. [29] which outperforms the state of the art results on real-world eye images. The main idea behind this 
method is to set up an accurate eye landmarks detector, which will eventually allow the estimation of the 
gaze. The data used to generate test, and evaluate the eye landmarks detection model come from 4 datasets 
EYEDIAP, MPIIGaze +, UT Multiview, and Columbia. The detection of facial and skeletal joint landmarks 
is a well-researched subject. Indeed, several studies have proposed architectures of deep convolutional neural 
networks to solve the problem of facial landmarks and skeletal joints detection [30], [31]. The authors have 
adapted the hourglass architecture [32] for the facial landmark detection task, this architecture was originally 
applied to the human pose estimation, with the aim to solve the recurrent issue of occlusion of a part of the 
body by a hand or an object. 

The region of interest is the eye, which contains fewer overall structuring elements than in pose 
estimation. This peculiarity allowed the detection with reasonable precision of the center of the eyeball and 
the iris edge landmarks in cases of occlusion. Using the eye landmarks that the system provided, the authors 
propose two scenarios for gaze estimation: 


3.3.1. Feature-based gaze estimation 

The features vector is made up of 17 coordinates: 8 of the limbus, 8 of the edge of the iris, 1 of the 
center of the iris and a 2D gaze direction. These landmark coordinates are normalized by the width of the eye 
c2-cl which represents respectively the outer and inner corner of the eye in a coordinate system centered at 
cl. The 2D gaze direction is obtained by subtracting the center of the eyeball from the center of the iris. A 
support vector regression (SVR) is then trained using the 36 landmark-based features to produce a model that 
estimates a 3D gaze direction representing the pitch and yaw of the eyeball. 
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3.3.2. Model-based gaze estimation 

Two intersected spheres were used to model the human eyeball, the first is large and the second is 
small to represent the corneal bulge. Two intersected spheres were used to model the human eyeball. The 
first one is large and the second is small to represent the corneal bulge. The data available to the system are 
the 8 landmarks of the edge of the iris predicted from the eye image, the eyeball center landmark and the iris 
center landmark. The radius of the eyeball is estimated in pixels from the coordinates of the eyeball center 
and those of the iris. Thus, the coordinates of the iris points as shown in: 


uij = xij = xc — rxy cos6j’ sindj' 
vij = yij = yc + rxy sind’ 


For model-based gaze estimation, 0, þ, 5, y which are the gaze direction (0, ), © angular iris radius 
and the angular offset y equivalent to eye roll are unknown. The already detected landmarks are used to solve 
this problem. Thus, authors proposed the use of an iterative optimization method such as the conjugate 
gradient. 

The gaze vector in the image coordinate is then deduced, its starting point is the center of the iris u 
(x, y) in pixel and its ending point is v (x, y) where: 


xv = xu + c *cosésing 
yv = yu + c * sin 


The coordinates obtained are in pixels in an image system coordinate. To get the point of gaze, we have first 
to transform the coordinates of the gaze vector from the image coordinates system (xp, yp) to the real-world 
coordinates system (X, Y, Z), whose origin O is the camera and the display board is a plane in (X, Y). The Z 
axis constitutes the depth between the plane (board, camera) and the student. The gaze point is then obtained 
by calculating the intersection between the gaze vector and the plane (display board). 


3.3.3. Camera calibration 

The cameras are based on a model-based imaging system called pinhole or perspective projection. 
Indeed, this allows the projection of the PW points of a 3D scene onto a 2D plane made up of pixels, which is 
the image. This can be expressed by: 


M = K= [Rit] 
K is the matrix that contains the intrinsic parameters and the extrinsic matrix [R|t] which are respectively the 


rotation matrix and the translation vector that define the coordinates changes from the real-world coordinate 
system to that of the camera. 


x y=0 cx 
K=10 fy cy 
0 0 1 


The different components of the intrinsic parameter matrix are as follows fx and fy represent the 
focal length on the x and y axes, expressed in pixels, y is the skew between the axes, in general it is equal to 
0, cx and cy are the coordinates in pixels of the intersection of the optical axis and the image plane. So, the 
relation between a point Pw in the world coordinate system (X, Y, Z) and its image projection Pc (xc, yc) in 
the image coordinate system is given by: 


xC X 
Pc [el = MPw |Y 
1 Z 


Determining the set of camera parameters that describe the mapping between the 3D reference 
coordinates and those of the 2D image is the primary goal of an operation called camera calibration [33]. In 
fact, it involves the image analysis of the projection of a series of characteristic points, which are 
characteristics inherent in an object whose three-dimensional coordinates are known with great precision 
[34]. The literature offers several calibration methods such as, calibration pattern, geometric clues and deep 
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learning based, in this study we chose to use a calibration method belonging to the calibration pattern family, 
because of its high precision and given that it has been implemented by several development environments 
like Python in its OpenCv library. 
The camera calibration algorithm is as: 
a. a chessboard of (10.7) squares is used as a calibration pattern, a series of 40 captures of the chessboard 
were taken under different angles see Figure 4. 
Detect the chessboard and locate the pixel coordinates of the black boxes’ corners in the images. 
c. Resolve the intrinsic and external parameters of the camera. 


Figure 4. Sample of the chessboard images used in camera calibration 


3.4. Estimating gaze point coordinate in the world coordinate system 

After the camera calibration step, which allowed us to get the camera Intrinsic and extrinsic 
parameters, we can now, calculate the coordinates in the real-world coordinates system of the vision vector. 
Thus, we can determinate the point of intersection between the gaze vector and the plane that contains the 
display board. 

As shown in Figure 1, in our system we assume that the camera is the center O of the real-world 
coordinate system, the display board is contained in the XOY plane, the Z axis and the optical axis of the 
camera are the same and that the positive direction is towards the front of the camera. Therefore, the 2D 
coordinates of the gaze point on the XOY plane can be obtained by switching from the image coordinates 
system to the real-world coordinates system by checking that Z= 0. 


; D 
Ux = —(xic — cx) er 


: D 
Uy = (ic = cy) * 77 
Vx = =s * sin + ux 
Vy =s* sino + vy 


Where, 
s =D/cos@ (2) 


3.5. Student attention quantification 

A student during the class session can only be in one of two states: attention and inattention. Thus, 
the student’s attention can be modeled by the following formula: A(t) € {0,1}. We consider a student in a 
state of attention {1,0} according to his gaze point, if his gaze point is in the plane P whose center is the 
camera, the X and Y axes are respectively the length and width of the display board. In this case, the level of 
attention is 1, the opposite case 0, where the gaze is completely outside the plane P. 


VGt(x,y) E P so A(t) = 1 else 0 
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4. EXPERIMENTAL RESULTS 

In order to corroborate our point of gaze estimation method, we have developed a test set up. A 
standalone application that implements the various functionalities of the system has been developed using the 
Python language and the OpenCV library. It processes images (10 frames/second with a resolution of 
1280/720 px) and returns results in real-time without delay on a workstation equipped with an Intel (R) Core 
(TM) i7-8665U CPU and memory 16 GB RAM. 

Three students participated in this experiment. Measurements of the width of the face of each 
student were made for calibration purposes. The students are seated at a distance of 1.5 m in front of the 
display board Figure 1. They were asked to point out randomly 9 locations marked on the board with blue 
dots during 10 seconds. The coordinates of the 9 points on the (X, Y) coordinates system were measured to 
allow comparison with the estimated gaze points. 

The Figure 5 shows the difference between the estimated positions and the actual positions for the 9 
points. The gaze point was considered to be the mean of the left and right eye gaze’s points coordinates (xi, 
yi). To simplify the readability of the results, 1/2 the length of the display board has been added to xi in order 
to obtain positive data. The estimation error was calculated for each axis (X, Y), Euclidean distance was used 
to calculate the distance between the actual values of the points displayed on the display board and their 
estimates. 

Analysis of the experimental data shows an average error of 4.6 cm and 4.3 cm on the two X, Y 
axes which largely meets our expectations. Our system is able to find the 2D coordinates of the student’s 
point of gaze on the slideshow plane at a distance of 1.5 m with an acceptable precision. We reiterate the 
objective of this system is to be able to identify, during the course lecture, the loss of attention of the student, 
which can be expressed by moving the student’s gaze away from the display board or by a stagnant gaze that 
indicates a thinking state or a having the head elsewhere state. 

The accuracy of our method can be comparable to that of Wan et al. [35] in terms of distance error 
on the X and Y axes, and the distance which separates the subject from the projection plane. However, the 
hardware device they used is more sophisticated than ours. They used a stereo camera and an infrared light 
source to calculate the cornea center. The use of infrared light can be impractical in outdoor environments or 
unhealthy under certain conditions [36]. The method of Gutiérrez et al. [37] employs an intrusive device, 
since it is necessary to set up a configuration that blocks the head movement and the subject must wear a pair 
of glasses on which the camera is placed. This method achieves a good accuracy, but on a very small distance 
that does not exceed 40 cm. 


80 4 
@ 4 | 
2 
2 È k a 
> 
40 4 
20 4 
=a 
va va {v 
T t T 4 J 
20 40 60 80 100 
v Reference coordinates @ Estimated coordinates 


Figure 5. Student’s gaze point estimation accuracy 


We present, in Table 2, a comparison of these three previously cited methods and ours. As you can 
see our method achieves considerable accuracy, noting that this precision is achieved without the use of 
sophisticated equipment or infrared lighting and the experimentation has been realized under real conditions. 
It should also be noted that, the good precision depends on the camera calibration and of the stage of 
calculating the distance between the camera and the student. 
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Table 2. Average error in cm of our method with that of the comparison methods 


Method Materiel Distance From the Gaze Error 
camera in cm X X 
Miot et al. [37] A camera and a motorized linear rail Variable from 20 to 40 0.29 0.38 
system to adjust the distance from the 
camera. 
Wan et al, [35] Sterio camera and near infrared light 80-400 (150) 0.6 0.6 
Miot et al. [37] and Wang et al. [38] camera 86 0.25 0.2 
Our proposition Single camera 150 4.6 4.3 


For the rest of the experiment, the participants were invited to follow a 10-minute class session 


without instructions. A presentation was projected in front of the students at the location defined as plane P. 
The images of the students were captured with a frequency of 10 frames/second to lighten the calculation. 
The direction of gaze is estimated for each student and the state of attention is detected. Figure 6 shows 3 
minutes of the recorded signal of one participant's detected attention and distraction states. 


5. 


Attention 


T T T T T 
0] 500 1000 1500 2000 2500 3000 
Frames 


Figure 6. Predicted student’s signal attention based on his gaze point 


CONCLUSION 
In this article, we have presented a successful method of tracking student attention during the 


classroom session by relying on the information the gaze can disclose. Our approach is based on the 
estimation of the student gaze point on the display board or slideshow to determine in a precise way if the 
student follows the explanations given by his teacher or is distracted by other elements present in the 
classroom. In future work, we will try to combine gaze point tracking with the head pose estimation in order 
to be able to analyze more attention situations such as the case when the student reading his notes or writing 
them, turn towards a comrade or even look at the ceiling. 
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